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ABSTRACT 

The field of gene expression analysis continues to 
benefit from next-generation sequencing generated 
data, which enables transcripts to be measured with 
unmatched accuracy and resolution. But the high- 
throughput reads from these technologies also 
contain many errors, which can compromise the 
ability to accurately detect and quantify rare tran- 
scripts. Fortunately, techniques exist to ameliorate 
the affects of sequencer error. We present 
RecountDB, a secondary database derived from 
primary data in NCBI's short read archive. 
RecountDB holds sequence counts from RNA-seq 
and 5' capped transcription start site experiments, 
corrected and mapped to the relevant genome. Via a 
searchable and browseable interface users can 
obtain corrected data in formats useful for 
transcriptomic analysis. The database is currently 
populated with 2265 entries from 45 organisms 
and continuously growing. RecountDB is publicly 
available at: http://recountdb.cbrc.jp. 

INTRODUCTION 

Sequencing-based transcriptome analysis has been a key 
area of biological inquiry for over a decade (1). In recent 
years, next-generation sequencing (NGS) technologies 
have revolutionized transcriptomics by providing compre- 
hensive, high-resolution measurement of the transcripts in 
a given biological sample [reviewed in (2,3)]. For example, 
RNA-seq has been used to uncover new alternative 
splicing events and discover new genes (4). Furthermore, 
5' capping methods allow exact positional detection of 
transcription start sites and absolute quantification of 
transcripts (5). 

Despite such benefits, it has been recognized that high- 
throughput reads from NGS technologies also contain 



substantial error, for Illumina typically ranging from 
0.3% at the beginning of reads to 3.8-25% at the end 
(6); and that without proper attention this error can 
cause significant artifacts (7,8). 

Unfortunately, simply removing sequences that fail to 
map well to the genome does not solve this problem, as 
many genomes (e.g. mammalian) are repetitive enough 
that erroneous sequences may infortuitously map well 
to a similar sequence in a different part of the 
genome. Figure 1 illustrates how the sequencing error 
combined with a high dynamic range thwarts simple 
count threshold-based quality control. 

To address this problem, many tools have been de- 
veloped for sequence correction. One class is aimed for 
genome sequencing (9-12). They assume the true se- 
quences have roughly uniform abundance and identify er- 
roneous sequence by their rareness. This approach is not 
suitable for transcriptome sequencing; where interesting 
transcripts (e.g. of transcription factors) often have low 
copy number. Recent extensions of these methods loosen 
the assumption of uniformity (13-15) but have not been 
evaluated for transcriptome analysis. 

Fortunately two correction tools have been designed for 
transcriptome analysis: FreClu (7) and recount (16). 
recount is designed to correct sequence count biases 
[including for those sequences which should have a zero 
count (17)] resulting from sequencing error in Solexa/ 
Illumina reads. It uses a probabilistic model to estimate 
the true expression reads based on their counts and quality 
scores, without using a reference genome (Figure 1). On a 
mouse, Solexa/Illumina, 5' capped transcriptome data set, 
recount increased the count of genome mappable se- 
quences by 13.85%, and in some cases the correction 
qualitatively changed the biological conclusions drawn 
from the data (16). Although FreClu and recount use 
different algorithms, in our evaluation their overall cor- 
rection was roughly comparable, but recount uses much 
less memory and made fewer large sequence count errors 
when applied to a simulated data set (16). 
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Figure 1. Effect of sequencing error on the sequence counts. Gray bars 
represent misread sequences, (a) True counts of input sequences; 
aaact and gactt are 3 x 10 5 and 600, respectively, (b) Output of 
sequencer. Due to sequencer error, misread sequences appear around 
each input sequence, like crumbs fallen off a cake. Unfortunately, the 
misreads of highly abundant sequence can have higher count than 
correct reads of rare sequences. Thus it is in general not possible to 
separate true and false sequences with a simple threshold, recount uses 
a probabilistic model to approximately infer (a) from (b). 



The NCBI sequence read archive (SRA) plays an im- 
portant role in preserving experimental data generated 
from NGS technologies for further studies (18). 
Typically gene expression levels are obtained from this 
data by first mapping reads to the reference genome, 
and then using gene annotation resources such as 
Aceview (19) or RefSeq (20) to obtain the expression 
levels of known genes. As discussed above, it is highly 
desirable to apply sequence count correction before 
mapping, and convenient for biologists if this can be 
done for them. For these reasons we developed 
RecountDB, in which we provide precomputed results of 
mapping sequences with last (21) after sequence correc- 
tion count with recount (16), for RNA-Seq and TSS-Seq 
experiments held in the SRA. RecountDB provides the 
mapped data in the de facto standard formats: PSL (22) 
and SAM/BAM (23), which can be directly forwarded to 
visualization and analysis software. Below we describe the 
database source, content and derivation in detail. 



DATABASE CONTENTS AND FORMATS 

RecountDB has been publicly available since June 2010. 
The current version contains 2265 entries from 45 organ- 
isms, with read lengths from 17 bp to 100 bp. Figure 2 
shows screen shots of a RecountDB keyword search. 
Results can be downloaded in three formats: TAB, PSL 
and SAM/BAM. Below we briefly describe each format. 

TAB format 

RecountDB provides a simple format that we call the 
'TAB format'. This simple format holds one sequence 
per line, each line consisting of three tab separated 
fields: the sequence itself, its observed counts and its cor- 
rected count. This format provides minimal information 



for programmers who want to patch recount correction 
into their own genome mapping protocol. 

PSL format 

The PSL (Pat Space Layout) format, used by BLAT and 
other UCSC tools (22), is a tab delimited format consist- 
ing of an optional header section and an alignment 
section. The alignment section consists of 21 fields docu- 
mented at the UCSC site http://genome.ucsc.edu/FAQ/ 
FAQformat.html#format2L. 

RecountDB provides alignments with 14 organisms in 
UCSC genomes that can be directly uploaded to the 
UCSC Genome browser (24) as a track. This enables 
biologists to visualize the transcriptome data together 
with the extensive annotation on gene structure, regula- 
tory elements etc., available in other tracks. 

SAM/BAM format 

RecountDB also provides data in BAM format, a com- 
pressed binary version of the SAM (Sequence Alignment/ 
Map) format (23). Many NGS analysis tools work with 
this format. BAM can be easily converted to SAM 
format with SAMtools (http://sourceforge.net/projects/ 
samtools/files/) (23). This format contains 11 mandatory 
fields important for alignment including the sequence and 
quality score information, which is not included in the 
PSL format. 

One feature of the SAM format definition is that it 
allows extension through the addition of extra fields. We 
take advantage of this by adding the observed and cor- 
rected sequence counts as the 14th and 15th fields 
(Figure 3). 



COMPUTATIONAL PROTOCOL 

The NCBI SRA repository holds (NGS) data sets from 
various experiments. For our purposes we used only data 
from TSS-seq and RNA-seq experiments. These data sets 
are required to be submitted in FASTQ format, which 
encodes the quality of each base in the reads. We make 
use of the quality scores both for sequence count correc- 
tion and during the mapping process. 

Error Correction with recount 

recount (16) adopts the count correction method 
proposed by BeiBbarth et al. (25), which is based on a 
probabilistic model which assumes that the probability 
of any particular sequence s being misread as some 
other sequence r is given along with the observed 
sequence counts. Conceptually, a directed weighted 
graph G{ V, E) is defined in which the vertices V are the 
possible sequences and the edge weight from sequence 
s to r represents the probability of misreading s as r. As 
outlined below, the probabilities are automatically derived 
from quality score information. To keep the computa- 
tion manageable two approximations are used: (i) se- 
quences with zero observed counts are excluded from 
consideration, (ii) edges representing very unlikely 
misread events are omitted, recount can do this approxi- 
mately by a hamming distance threshold or more 
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Figure 2. RecountDB's search interface, (a) The snapshot of the RecountDB entry page. Users can perform searches using keywords such as genome 
name, or type of study, or NCBI-SRA file ID. A browseable interface can also be accessed through the link in this page, (b) A typical RecountDB 
keyword search result page. Each entry contains basic information such as data submitter, type of study and sample source. The results are presented 
in three formats: TAB, PSL and SAM/BAM (see main text for explanation). The link in depicted as a globe symbol allows users to reach the 
NCBI-SRA primary site for the data, where the user can access the original FASTQ file. 
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Figure 3. Additional fields provided in the RecountDB SAM format 
data. OC refers to observed count, and EC estimated count (after cor- 
rection), 'f refers to type of value (float) and is followed by the values 
of each type of count. 

rigorously by a probability threshold. Currently 
RecountDB holds results for hamming distance one. 
Following BeiBbarth et al. (25), recount uses a kind of 
Expectation-Maximization (EM) (26) procedure to infer 
the set of true counts that locally maximizes the likelihood 
of the observed reads. 

To describe how recount derives misread probabilities; 
we define the consensus sequence of a read as the sequence 
called from that read (i.e. the concatenation of the most 
probable base at each position, using alphabetical order to 
break ties), and the reads of a sequence s as the set of reads 
whose consensus sequence is s. Recall that for any read of 
s, the quality scores give the probability that s should have 
been called as another sequence r. Therefore recount 
estimates the probability of r being misread as s, as the 
harmonic average of that quantity for all reads of 5. Note 
that although recount is capable of utilizing non-uniform 
base miscall probabilities [for example on Illumina C T 
miscalls are particularly common (6)], the FASTQ data 
format found in the SRA only gives one quality score 
for each position. Thus the contents of RecountDB 
assume miscalled bases are equally likely to be any of 
the three possible choices. 

Alignment of reads with last 

Count corrected mapping, last (http://last.cbrc.jp/) is a 
local alignment tool designed for fast and accurate 
genome alignment (21). Using it requires executing 
two programs: lastdb for constructing an index of 
the genome, and lastal for aligning queries using 
the index. The parameters listed below follow the recom- 
mendations of the last documentation. 

We indexed the reference genome of each relevant 
species using lastdb with option -mllllllO, which 



stipulates a spaced seed ignoring mismatches at every 
seventh position, a pattern suitable for aligning short 
reads. For lastal we use the parameters: -r6 -ql8 
-a21 -b9 -dl08 -el20; -r6 sets the match score to 
6, -ql8 the mismatch cost to 18, -a21 gap existence cost 
to 21 and -b9 gap extension cost to 9. The output of these 
alignments are provided in BAM format. 

Uncorrected mapping. For users interested in how 
recount changed the mapped results, we also provide 
uncorrected data in PSL format, mapped using last 
quality-score alignment. The parameters are -Ql -dl08 
-el2 0. The option -Ql tells last to expect quality score 
information in FASTQ-Sanger format. -dl08 sets the 
minimum score for gapless alignments to 108 and 
-el2 0 sets minimum score for gapped alignments to 120. 

Quality control. For both types of alignment, we only 
report alignments with high mapping probabilities. For 
this purpose we use the last script last-map- 
probs.py -sl5 0. With this parameter setting, when 
mapping 50 bp reads to the human genome we expect a 
random spurious alignment to be mistakenly reported 
only once every few thousand reads. 

Format conversion. The default output of lastal is in 
MAF format. We converted this to PSL and SAM 
format using the maf -convert .py script, from the 
last package. 



CONCLUSIONS 

Over the next years the availability of NGS transcriptome 
data will increase tremendously, covering a growing 
number of organisms, tissues, cell types and conditions. 
To obtain reliable conclusions from this data, accurate 
estimates of transcript abundance are needed. 
RecountDB can play a valuable role by providing 
count corrected RNA-Seq and TSS-Seq data in a conveni- 
ent form. 
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RecountDB is continuously updated in an automatic 
fashion. The compact representation of our data (i.e. 
TAB, PSL and the compressed BAM format) allow us 
to efficiently store the data. However, the number of 
amount of data is expected to increase very rapidly. To 
respond to this, apart from strengthening our hardware 
infrastructure we also plan to investigate the possibility 
applying more effective data compression techniques (27). 

Finally, as touched upon in the introduction, we note 
that sequence count correction is an area of intense 
research. As is true for the other methods, recount still 
has room for improvement — for example it assumes the 
quality scores are approximately accurate. In the future we 
plan to continue our efforts to improve recount and 
updated RecountDB accordingly. 
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